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Abstract 

The dueling bandit problem is a variation of the classical multi-armed 
bandit in which the allowable actions are noisy comparisons between pairs 
of arms. This paper focuses on a new approach for finding the “best” arm 
according to the Borda criterion using noisy comparisons. We prove that 
in the absence of structural assumptions, the sample complexity of this 
problem is proportional to the sum of the inverse squared gaps between the 
Borda scores of each suboptimal arm and the best arm. We explore this 
dependence further and consider structural constraints on the pairwise 
comparison matrix (a particular form of sparsity natural to this problem) 
that can signihcantly reduce the sample complexity. This motivates a new 
algorithm called Successive Elimination with Comparison Sparsity (SECS) 
that exploits sparsity to hnd the Borda winner using fewer samples than 
standard algorithms. We also evaluate the new algorithm experimentally 
with synthetic and real data. The results show that the sparsity model 
and the new algorithm can provide significant improvements over standard 
approaches. 


1 INTRODUCTION 


The dueling bandit is a variation of the classic multi-armed bandit problem in 
which the actions are noisy comparisons between arms, rather than observa¬ 


tions from the arms themselves (Yue et al. 20121. Each action provides 1 bit 


indicating which of two arms is probably better. For example, the arms could 
represent objects and the bits could be responses from people asked to compare 
pairs of objects. In this paper, we focus on the pure exploration problem of 
finding the “best” arm from noisy pairwise comparisons. This problem is dif¬ 


ferent from the explore-exploit problem studied in Yue et al. (20121. There can 


be different notions of “best” in the dueling framework, including the Condorcet 
and Borda criteria (defined below). 

Most of the dueling-bandit algorithms are primarily concerned with finding 
the Condorcet winner (the arm that is probably as good or better than every 
other arm). There are two drawbacks to this. First, a Condorcet winner does 
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not exist unless the underlying probability matrix governing the outcomes of 
pairwise comparisons satisfies certain restrictions. These restrictions may not be 
met in many situations. In fact, we show that a Condorcet winner doesn’t exist 
in our experiment with real data presented below. Second, the best known upper 
bounds on the sample complexity of finding the Condorcet winner (assuming 
it exists) grow quadratically (at least) with the number of arms. This makes 
Condorcet algorithms impractical for large numbers of arms. 

To address these drawbacks, we consider the Borda criterion instead. The 
Borda score of an arm is the probability that the arm is preferred to another arm 
chosen uniformly at random. A Borda winner (arm with the largest Borda score) 
always exists for every possible probability matrix. We assume throughout this 
paper that there exists a unique Borda winner. Finding the Borda winner 
with probability at least 1 — d can be reduced to solving an instance of the 
standard multi-armed bandit problem resulting in a sufficient sample complexity 
of O ~ Si)“^log (log((si — Si)“^)/(5)), where Si denotes Borda score 

of arm i and Si > S 2 > • ■ ■ > s„ are the scores in descending order (Karnin 
et al.[[20T^ Jamieson et al.[[2014[|. In favorable cases, for instance, if Si — Si > c. 


a constant for all i > 1, then this sample complexity is linear in n as opposed to 
the quadratic sample complexity necessary to find the Condorcet winner. In this 
paper we show that this upper bound is essentially tight, thereby apparently 
“closing” the Borda winner identification problem. However, in this paper we 
consider a specific type of structure that is motivated by its existence in real 
datasets that complicates this apparently simple story. In particular, we show 
that the reduction to a standard multi-armed bandit problem can result in very 
bad performance when compared to an algorithm that exploits this observed 
structure. 

We explore the sample complexity dependence in more detail and consider 
structural constraints on the matrix (a particular form of sparsity natural to 
this problem) that can significantly reduce the sample complexity. The sparsity 
model captures the commonly observed behavior in elections in which there are 
a small set of “top” candidates that are competing to be the winner but only 
differ on a small number of attributes, while a large set of “others” are mostly 
irrelevant as far as predicting the winner is concerned in the sense that they 
would always lose in a pairwise matchup against one of the “top” candidates. 

This motivates a new algorithm called Successive Elimination with Compar¬ 
ison Sparsity (SECS). SECS takes advantage of this structure by determining 
which of two arms is better on the basis of their performance with respect to 
a sparse set of “comparison” arms. Experimental results with real data demon¬ 
strate the practicality of the sparsity model and show that SECS can provide 
significant improvements over standard approaches. 

The main contributions of this paper are as follows: 


• A distribution dependent lower bound for the sample complexity of iden¬ 
tifying the Borda winner that essentially shows that the Borda reduction 
to the standard multi-armed bandit problem (explained in detail later) 
is essentially optimal up to logarithmic factors, given no prior structural 
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information. 

• A new structural assumption for the n-armed dueling bandits problem in 
which the top arms can be distinguished by duels with a sparse set of 
other arms. 

• An algorithm for the dueling bandits problem under this assumption, with 
theoretical performance guarantees showing significant sample complexity 
improvements compared to naive reductions to standard multi-armed ban¬ 
dit algorithms. 

• Experimental results, based on real-world applications, demonstrating the 
superior performance of our algorithm compared to existing methods. 


2 PROBLEM SETUP 


The n-armed dueling bandits problem (Yue et al., 20121 is a modification of the 
n-armed bandit problem, where instead of pulling a single arm, we choose a pair 
of arms {i,j) to duel, and receive one bit indicating which of the two is better or 
preferred, with the probability of i winning the duel is equal to a constant pij 
and that of j equal to pj^i = 1 —Pij- We define the probabilty matrix P = [pij], 


Almost all existing n-armed dueling bandit methods (Yue et al. 

2012, Yue 

and Joachims 2011 

Zoghi et al. 

2013 

Urvoy et al. 

2013 

Ailon et al. 

20141 


focus on the explore-exploit problem and furthermore make a variety of as¬ 
sumptions on the preference matrix P. In particular, those works assume the 
existence of a Condorcet winner: an arm, c, such that Pcj > | for all j ^ c. The 
Borda winner is an arm b that satisfies ^j^i,Pb,j > alH = 1, • • • , n. 

In other words, the Borda winner is the arm with the highest average probability 
of winning against other arms, or said another way, the arm that has the highest 
probability of winning against an arm selected uniformly at random from the 
remaining arms. The Condorcet winner has been given more attention than the 
Borda, the reasons being: 1) Given a choice between the Borda and the Con¬ 
dorcet winner, the latter is preferred in a direct comparison between the two. 
2) As pointed out in Urvoy et al. (20131; Zoghi et al. (20131 the Borda winner 
can be found by reducing the dueling bandit problem to a standard multi-armed 
bandit problem as follows. 


Definition 1. Borda Reduction. The action of pulling arm i with reward 
can be simulated by dueling arm i with another arm chosen uni¬ 
formly at random. 


However, we feel that the Borda problem has received far less attention than 
it deserves. Firstly, the Borda winner always exists, the Condorcet does not. 
For example, a Condorcet winner does not exist in the MSLR-WEBlOk datasets 
considered in this paper. Assuming the existence of a Condorcet winner severely 
restricts the class of allowed P matrices: only those P matrices are allowed 
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which have a row with all entries > h- In fact, 


Joachims (20111 require that the comparison probabilities pij satisfy additional 


Yue et al. 


( 20121 ; 


Yue and 


transitivity conditions that are often violated in practice. Secondly, there are 
many cases where the Borda winner and the Condorcet winner are distinct, and 
the Borda winner would be preferred in many cases. Lets assume that arm c is 
the Condorcet winner, with = 0.51 for i ^ c. Let arm b be the Borda winner 
with ph^i = 1 for i b,c, and = 0.49. It is reasonable that arm c is only 
marginally better than the other arms, while arm b is significantly preferred 
over all other arms except against arm c where it is marginally rejected. In 
this example - chosen extreme to highlight the pervasiveness of situations where 
the Borda arm is preferred - it is clear that arm b should be the winner: think 
of the arms representing objects being contested such as t-shirt designs, and 
the P matrix is generated by showing users a pair of items and asking them 
to choose the better among the two. This example also shows that the Borda 
winner is more robust to estimation errors in the P matrix (for instance, when 
the P matrix is estimated by asking a small sample of the entire population 
to vote among pairwise choices). The Condorcet winner is sensitive to entries 
in the Condorcet arm’s row that are close to which is not the case for the 
Borda winner. Finally, there are important cases (explained next) where the 
winner can be found in fewer number of duels than would be required by Borda 
reduction. 


3 MOTIVATION 


We define the Borda score of an arm i to be the probability of the i**' arm 
winning a duel with another arm chosen uniformly at random: 

~ '^Phj ■ 

3¥=i 


Without loss of generality, we assume that si > S 2 > ■ ■ ■ > Sn but that this or¬ 
dering is unknown to the algorithm. As mentioned above, if the Borda reduction 
is used then the dueling bandit problem becomes a regular multi-armed ban¬ 


dit problem and lower bounds for the multi-armed bandit problem (Kaufmann 
et al. 2014, Manner and Tsitsiklis[ 20041 suggest that the number of samples 


required should scale like Vt 


(si-Si)2 log 5 

Borda scores, and not the individual entries of the preference matrix. 


which depends only on the 
This 


would imply that any preference matrix P with Borda scores Si is just as hard 
as another matrix P' with Borda scores s' as long as (si — s^) = (s'j^ — s'). Of 
course, this lower bound only applies to algorithms using the Borda reduction, 
and not any algorithm for identifying the Borda winner that may, for instance, 
collect the duels in a more deliberate way. Next we consider specific P matrices 
that exhibit two very different kinds of structure but have the same differences 
in Borda scores which motivates the structure considered in this paper. 
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3.1 Preference Matrix P known up to permutation of in¬ 
dices 


Shown below in equations Q and ([^ are two preference matrices Pi and P 2 
indexed by the number of arms n that essentially have the same Borda gaps 
- (si — Si) is either like ^ or approximately 1/4 - but we will argue that Pi 
is much “easier” than P 2 in a certain sense (assume e is an unknown constant, 
like e = 1/5). Specifically, if given Pi and P 2 up to a permutation of the labels 
of their indices (i.e. given APiA^ for some unknown permutation matrix A), 
how many comparisons does it take to find the Borda winner in each case for 
different values of n? 

Recall from above that if we ignore the fact that we know the matrices up 
to a permutation and use the Borda reduction technique, we can use a multi¬ 
armed bandit algorithm (e.g. Karnin et al. (20131; pamieson et al. ( 2014[ |) and 
find the best arm for both Pi and P 2 using O (rP log(log(n))) samples. We 
next argue that given Pi and P 2 up to a permutation, there exists an algorithm 
that can identify the Borda winner of Pi with just 0(nlog(n)) samples while 
the identification of the Borda winner for P 2 requires at least Vtijp) samples. 
This shows that given the probability matrices up to a permutation, the sample 
complexity of identifying the Borda winner does not rely just on the Borda 
differences, but on the particular structure of the probability matrix. 

Consider Pi. We claim that there exists a procedure that exploits the struc¬ 
ture of the matrix to find the best arm of Pi using just 0(nlog(n)) samples. 
Here’s how: For each arm, duel it with 32 log ^ other arms chosen uniformly at 
random. By Hoeffding’s inequality, with probability at least 1 — 5 our empirical 
estimate of the Borda score will be within 1/8 of its true value for all n arms 
and we can remove the bottom (n — 2) arms due to the fact that their Borda 
gaps exceed 1/4. Having reduced the possible winners to just two arms, we can 
identify which rows in the matrix they correspond to and duel each of these two 
arms against all of the remaining (n — 2) arms 0{\) times to find out which 


one has the larger Borda score using just O 


2(ra-2) 


samples, giving an over¬ 


all sample complexity of O (n log n). We have improved the sample complexity 
from 0(n^ log(log(n))) using the Borda reduction to just 0(nlog(n)). 

Consider P 2 . We claim that given this matrix up to a permutation of its 
indices, no algorithm can determine the winner of P 2 without requesting 
samples. To see this, suppose an oracle has made the problem easier by re¬ 
ducing the problem down to just the top two rows of the P 2 matrix. This is a 
binary hypothesis test for which Fano’s inequality implies that to guarantee that 
the probability of error is not above some constant level, the number of sam¬ 
ples to identify the Borda winner must scale like minjg[„]\{i 2 } kl(pi p 2 ) — 


minjg[„]\{i 2 } .)2 = where the inequality holds for some c by 

Lemma in the Appendix. 

We just argued that the structure of the P matrix, and not just the Borda 
gaps, can dramatically influence the sample complexity of finding the Borda 
winner. This leads us to ask the question: if we don’t know anything about 
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the P matrix beforehand (i.e. do not know the matrix up to a permutation of 
its indices), can we learn and exploit this kind of structural information in an 
online fashion and improve over the Borda reduction scheme? The answer is 
no, as we argue next. 


3.2 Distribution-Dependent Lower Bound 

We prove a distribution-dependent lower bound on the complexity of finding 
the best Borda arm for a general P matrix. This is a result important in its 
own right as it shows that the lower bound obtained for an algorithm using the 
Borda reduction is tight, that is, this result implies that barring any structural 
assumptions, the Borda reduction is optimal. 

Definition 2. J-PAC dueling bandits algorithm; A 5-PAC dueling bandits algo¬ 
rithm is an algorithm that selects duels between arms and based on the outcomes 
finds the Borda winner with probability greater than or equal to 1 — S. 


The techniques used to prove the following result are inspired from Lemma 
1 in Kaufmann et al. (|2014[| and Theorem 1 in|Mannor and Tsitsiklis (20041. 


Theorem 1. (Distribution-Dependent Lower Bound) Consider a matrix P such 
that I < pij < |,Vf,j € [n] with n > 4. Let t be the total number of duels. 
Then for 6 < 0.15, any 5-PAC dueling bandits algorithm to find the Borda 
winner has 


Ep[t] > C'log^^ 

i/i 


1 

(si - Si)^ 


where Si = denotes the Borda score of arm i. Furthermore, C 

can be chosen to be 1/90. 


The proof can be found in the supplementary material. 

In particular, this implies that for the preference matrix Pi in (Q, any algo¬ 
rithm that makes no assumption about the structure of the P matrix requires 
n (n^) samples. Next we argue that the particular structure found in Pi is an 
extreme case of a more general structural phenomenon found in real datasets 
and that it is a natural structure to assume and design algorithms to exploit. 


3.3 Motivation from Real-World Data 


The matrices Pi and P 2 above illustrate a key structural aspect that can make 
it easier to find the Borda winner. If the arms with the top Borda scores are 
distinguished by duels with a small subset of the arms (as exemplified in Pi), 
then finding the Borda winner may be easier than in the general case. Before 
formalizing a model for this sort of structure, let us look at two real-world 
datasets, which motivate the model. 

We consider the Microsoft Learning to Rank web search datasets MSLR- 
WEBlOk (Qin et al., 2010| and MQ2008-list (Qin and Liu, 20131 (see the 
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experimental section for a desorptions). Each dataset is used to construct a 
corresponding probability matrix P. We use these datasets to test the hypoth¬ 
esis that comparisons with a small subset of the arms may suffice to determine 
which of two arms has a greater Borda score. 

Specifically, we will consider the Borda score of the best arm (arm 1) and 
every other arm. For any other arm i > 1 and any positive integer fc € [n — 2], 
let be a set of cardinality k containing the indices j S [n] \ {!,*} with 
the k largest discrepancies \pij — pij\. These are the duels that, individually, 
display the greatest differences between arm 1 and i. For each k, define ai{k) = 
2(pi,i ~ |) + J2jeni ~ Pij)- If III® hypothesis holds, then the duels with 

a small number of (appropriately chosen) arms should indicate that arm 1 is 
better than arm i. In other words, ai{k) should become and stay positive as soon 
as k reaches a relatively small value. Plots of these ai curves for two datasets 
are presented in Figures and indicate that the Borda winner is apparent for 
small k. This behavior is explained by the fact that the individual discrepancies 
\pij — Pij\, decay quickly when ordered from largest to smallest, as shown in 
Figure 

The take away message is that it is unnecessary to estimate the difference 
or gap between the Borda scores of two arms. It suffices to compute the partial 
Borda gap based on duels with a small subset of the arms. An appropriately 
chosen subset of the duels will correctly indicate which arm has a larger Borda 
score. The algorithm proposed in the next section automatically exploits this 
structure. 



Figure 1: Plots of (A:) = 2(pi_i —f)-|-^^-gQ. ^(pij —pij) vs. fc for 30 randomly 
chosen arms (for visualization purposes); MSLR-WEBlOk on left, MQ2008-list 
on right. The curves are strictly positive after a small number of duels. 


4 ALGORITHM AND ANALYSIS 


In this section we propose a new algorithm that exploits the kind of structure 
just described above and prove a sample complexity bound. The algorithm is 
inspired by the Successive Elimination (SE) algorithm of Even-Dar et al. (20061 
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Figure 2: Plots of discrepancies \pij —Pi,j \ in descending order for 30 randomly 
chosen arms (for visualization purposes); MSLR-WEBlOk on left, MQ2008-list 
on right. 

for standard multi-armed bandit problems. Essentially, the proposed algorithm 
below implements SE with the Borda reduction and an additional elimination 
criterion that exploits sparsity (condition 1 in the algorithm). We call the 
algorithm Successive Elimination with Comparison Sparsity (SECS). 

We will use 1e to denote the indicator of the event E and [n] = {1,2,..., n}. 
The algorithm maintains an active set of arms At such that if j ^ At then the 
algorithm has concluded that arm j is not the Borda winner. At each time t, 
the algorithm chooses an arm It uniformly at random from [n] and compares it 
with all the arms in At- Note that Ak Q Ai for all k > i. Let € (0,1} be 
independent Bernoulli random variables with = pij, each denoting the 

outcome of “dueling” i, j S [n] at time t (define zf'- = 0 for i = _)). For any 
f > 1, * G [n], and j S At define 



so that E [pj^i^t] = Pj,i- Furthermore, for any t > 1, j G At define 



so that E = Sj. For any C [n] and i,j G [n] define 


— 2(pij 2 )”*” 
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Algorithm 1: Sparse Borda Algorithm 


Input sparsity level fc S [n — 2], time gate Tq > 0 
Start with active set Ai = {1, 2, • • • , n}, t = 1 

T gf (J _ / 21og(4n^t^/(5) 21og(4n^t^/^) 

* Y tjn. ' 3tjn 

while I At I > 1 do 

Choose It uniformly at random [n]. 
for j G At do 

Observe zj‘l and update pjj^^t = j SLi 




n/(n—1) 




w 

j,h' 


end 


At+i = At\ G At :3i G At with 
1) l{t>To}^ij.t (argmaxf 2 cH:|n|=fc ^*^,4(0)) > 6{k + l)Ct 


OR 2) Si^t > Sj^t + 


t i — t “h 1 


2 \og{4nt^/5) \ 

' J 


end 


The quantity Atj (0) is the partial gap between the Borda scores for i and j, 
based on only the comparisons with the arms in fl. Note that = 

Si — Sj. The quantity arg maxf 2 c[n]:|a|=fe Vtj (O) selects the indices oj yielding 
the largest discrepancies |pt ,^ ~ Pj,uj\- ^ and V are empirical analogs of these 
quantities. 

Definition 3. For any i G [n] \ 1 we say the set {{pi^ui — is ( 7 , k)- 

approximately sparse if 


max Vij(0\0t) < 7A1 *(0*) 
nG[n]:|n|<fc . ' ' ^ \ / 

where O,- = arg max Vi ,(0). 
nc[n]:|n|=fe 

Instead of the strong assumption that the set {{pi^uj — Pi,u})}uj^iiii has no 
more than k non-zero coefficients, the above definition relaxes this idea and just 
assumes that the absolute value of the coefficients outside the largest k are small 
relative to the partial Borda gap. This definition is inspired by the structure 
described in previous sections and will allow us to find the Borda winner faster. 

The parameter Tq is specified (see Theorem to guarantee that all arms 
with sufficiently large gaps si — Si are eliminated by time step Tq (condition 2). 
Once t > Tq, condition 1 also becomes active and the algorithm starts removing 
arms with large partial Borda gaps, exploiting the assumption that the top 
arms can be distinguished by comparisons with a sparse set of other arms. The 
algorithm terminates when only one arm remains. 


10 












Theorem 2. Let k > 0 and Tq > 0 be inputs to the above algorithm and let R 
be the solution to ^ log ^ ^ = Tq. If for all i G [n] \ 1 , at least one of the 

following holds: 

1- is {\,k)-approximately sparse, 

2. (si - Si) > R, 


then with probability at least 1 — 35, the algorithm returns the best arm after no 
more than 


c ^ min I max | log , 

i>i 



samples where A,- := si — s, and c > 0 is an absolute constant. 


The second argument of the min is precisely the result one would obtain 


by running Successive Elimination with the Borda reduction (Even-Dar et al. 


20061. Thus, under the stated assumptions, the algorithm never does worse 


than the Borda reduction scheme. The first argument of the min indicates the 
potential improvement gained by exploiting the sparsity assumption. The first 
argument of the max is the result of throwing out the arms with large Borda 
differences and the second argument is the result of throwing out arms where a 
partial Borda difference was observed to be large. 

To illustrate the potential improvements, consider the Pi matrix discussed 
above, the theorem implies that by setting Tg = ^ log ^ j with R = 

~ i fc = 1 we obtain a sample complexity of 0{e~^n\og{n)) 
for the proposed algorithm compared to the standard Borda reduction sample 
complexity of n(n^). 

In practice it is difficult optimize the choice of Tq and k, but motivated by 
the results shown in the experiments section, we recommend setting Tq = 0 and 
fc = 5 for typical problems. 


5 EXPERIMENTS 


The goal of this section is not to obtain the best possible sample complexity 
results for the specified datasets, but to show the relative performance gain of 
exploiting structure using the proposed SECS algorithm with respect to the 
Borda reduction. That is, we just want to measure the effect of exploiting 
sparsity while keeping all other parts of the algorithms constant. Thus, the 
algorithm we compare to that uses the simple Borda reduction is simply the 
SECS algorithm described above but with Tg = oo so that the sparse condition 
never becomes activated. Running the algorithm in this way, it is very closely 
related to the Successive Elimination algorithm of Even-Dar et al. (2006 1 . In 
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# of arms n 


Figure 3: Comparison of the Borda reduction algorithm and the proposed SECS 
algorithm ran on the Pi matrix for different values of n. Plot is on log-log scale 
so that the sample complexity grows like n® where s is the slope of the line. 


what follows, our proposed algorithm will be called SECS and the benchmark 
algorithm will be denoted as just the Borda reduction (BR) algorithm. 

We experiment on both simulated data and two real-world datasets. During 
all experiments, both the BR and SECS algorithms were run with 6 = 0.1. For 
the SECS algorithm we set Tg = 0 to enable condition 1 from the very beginning 
(recall for BR we set Tq = oo). Also, while the algorithm has a constant factor 
of 6 multiplying (fc -|- l)Ct, we feel that the analysis that led to this constant 
is very loose so in practice we recommend the use of a constant of 1/2 which 
was used in our experiments. While the change of this constant invalidates the 
guarantee of Theorem we note that in all of the experiments to be presented 
here, neither algorithm ever failed to return the best arm. This observation also 
suggests that the SECS algorithm is robust to possible inconsistencies of the 
model assumptions. 

5.1 Synthetic Preference matrix 

Both algorithms were tasked with finding the best arm using the Pi matrix of 
Q with e = 1/5 for problem sizes equal to n = 10, 20, 30,40, 50, 60, 70, 80 arms. 
Inspecting the Pi matrix, we see that a value of fc = 1 in the SECS algorithm 
suffices so this is used for all problem sizes. The entries of the preference matrix 
Pij are used to simulate comparisons between the respective arms and each 
experiment was repeated 75 times. 

Recall from Section that any algorithm using the Borda reduction on the 
Pi matrix has a sample complexity of D(n^). Moreover, inspecting the proof 
of Theorem one concludes that the BR algorithm has a sample complexity of 
0{n? log{n)) for the Pi matrix. On the other hand. Theorem ^states that the 
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SECS algorithm should have a sample complexity no worse than 0(nlog(n)) 
for the Pi matrix. Figure plots the sample complexities of SECS and BR on 
a log-log plot. On this scale, to match our sample complexity hypotheses, the 
slope of the BR line should be about 2 while the slope of the SECS line should 
be about 1, which is exactly what we observe. 


5.2 Web search data 


We consider two web search data sets. The first is the MSLR-WEBlOk Mi¬ 


crosoft Learning to Rank data set (Qin et al. 20101 that is characterized by 


approximately 30,000 search queries over a number of documents from search 
results. The data also contains the values of 136 features and corresponding user 
labelled relevance factors with respect to each query-document pair. We use the 
training set of Fold 1, which comprises of about 2,000 queries. The second data 
set is the MQ2008-list from the Microsoft Learning to Rank 4.0 (MQ2008) data 


set (Qin and Liu 20131. We use the training set of Fold 1, which has about 550 


queries. Each query has a list of documents with 46 features and corresponding 
user labelled relevance factors. 

For each data set, we create a set of rankers, each corresponding to a feature 
from the feature list. The aim of this task is be to determine the feature whose 
ranking of query-document pairs is the most relevant. To compare two rankers, 
we randomly choose a pair of documents and compare their relevance rankings 
with those of the features. Whenever a mismatch occurs between the rankings 
returned by the two features, the feature whose ranking matches that of the 
relevance factors of the two documents “wins the duel”. If both features rank 
the documents similarly, the duel is deemed to have resulted in a tie and we 
flip a fair coin. We run a Monte Carlo simulation on both data sets to obtain 
a preference matrix P corresponding to their respective feature sets. As with 
the previous setup, the entries of the preference matrices {[P]ij = Pij) are used 
to simulate comparisons between the respective arms and each experiment was 
repeated 75 times. 

From the MSLR-WEBlOk data set, a single arm was removed for our ex¬ 
periments as its Borda score was unreasonably close to the arm with the best 
Borda score and behaved unlike any other arm in the dataset with respect to its 
ai curves, confounding our model. For these real datasets, we consider a range 
of different k values for the SECS algorithm. As noted above, while there is 
no guarantee that the SECS algorithm will return the true Borda winner, in all 
of our trials for all values of k reported we never observed a single error. This 
is remarkable as it shows that the correctness of the algorithm is insensitive to 
the value of k on at least these two real datasets. The sample complexities of 
BR and SECS on both datasets are reported in Figure We observe that the 
SECS algorithm, for small values of k, can identify the Borda winner using as 
few as half the number required using the Borda reduction method. As k grows, 
the performance of the SECS algorithm becomes that of the BR algorithm, as 
predicted by Theorem 

Lastly, the preference matrices of the two data sets support the argument 
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Figure 4: Comparison of an action elimination-style algorithm using the Borda 
reduction (denoted as BR) and the proposed SECS algorithm with different 
values of k on the two datasets. 


for finding the Borda winner over the Condorcet winner. The MSLR-WEBlOk 
data set has no Condorcet winner arm. However, while the MQ2008 data set 
has a Condorcet winner, when we consider the Borda scores of the arms, it ranks 
second. 
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A Proof of Lower Bound 


We begin by stating a few technical lemmas. At the heart of the proof of the 
lower bound is Lemma 1 of Kaufmann et al. (20141 restated here for complete¬ 
ness. 


Lemma 1. Let v and v' be two bandit models defined over n arms. Let a 
be a stopping time with respect to {Xt) and let A G be an event such that 
0 < Pi/(A) < 1. Then 

71 

Y,¥,,[Na{cj)]KL{va,i^',f) > d(Pi/(A),Pi/,(A)) 

a—1 

where d{x,y) = x\og{x/y) J- (1 — x) log((l — x)/(l — y)). 

Note that the function d is exactly the KL-divergence between two Bernoulli 
distributions. 


Corollary 1. Let Wj = Nj,i denote the number of duels between arms i and j. 
For the duelling bandits problem with n arms, we have ——free parameters 
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(or arms). These are the numbers in the upper triangle of the P matrix. Then, 
if P' is an alternate matrix, we have from Lemma 

n n 

EE > d{Fp{A),Fp,{A)) 

i=l j=i+l 


The above corollary relates the cumulative number of duels of a subset of 
arms to the uncertainty between the actual distribution and an alternative dis¬ 
tribution. In deference to interpretability rather than preciseness, we will use 
the following bound of the KL divergence. 

Lemma 2. (Upper bound on KL Divergence for Bernoullis) Consider two Bernoulli 
random variables with means p and q, 0 < p,q < 1. Then 


d{p, q) < 


{p - qf 

9(1 - q) 


Proof. 

d(p,q) =p\og- + (1 -p)log -—- < p-—- + (1 -p )^—- = ' 

q 1- q q 1 - q q{l - q) 

where we use the fact that logx <x—lfora::>0. □ 


We are now in a position to restate and prove the lower bound theorem. 

Theorem 3. (Lower bound on sample complexity of finding Borda winner 
for the Dueling Bandits Problem) Consider a matrix P such that | < p^j < 
^,yi,j G [n], and n > 3. Then for S < 0.15, any 6-PAC dueling bandits algo¬ 
rithm to find the Borda winner has 


Ep[r] > C 



1 

(si - SiY 


log 


1 

Yd 


where Si = ^ pij denotes the Borda score of arm i. C can be chosen to be 

jp 

40 ■ 

Proof. Consider an alternate hypothesis P' where arm b is the best arm, and 
such that P' differs from P only in the indices {bj : j ^ {1,6}}. Note that the 
Borda score of arm 1 is unaffected in the alternate hypothesis. Corollary then 
gives us: 

^ Ep[N,MPb,pPlj)>diFiA),F{A)) (3) 

iG[ra]\{l.h} 

Let A be the event that the algorithm selects arm 1 as the best arm. Since we 
assume a 6-PAC algorithm, Pp(A) >1 — 6, Pp'(A) < 6. It can be shown that 
for 6 < 0.15, (i(Pp(A),Pp/(A)) > log^. 
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Define Nb = ^b,j- Consider 

j^b 




= ( max dipb,j,plj)) ^Ep[iVb,j 
Ko^i^b} J 

- ( I J2 '^Pi^bj 

> ^ Ep[Nbj]dipbj,p'bj) 

je[n]\{l,b} 

>log^. (by(§) 


(4) 


In particular, choose ^ = pbj + — St) + s, j ^ {1, b}. As required, 

under hypothesis P', arm b is the best arm. 

Since pbj < |, si < |, and st > |, as e \ 0, limp[,^ < This implies 

< » < 20. Q implies 


20 


n — 1 
n-2* 


(si - Sb)+e] Ep[iVb] > log — 


Ep[Af 6 ] > 


1 /n-2 


20 V n - 1 


25 

(si - Sb^ ^ 


(5) 


where we let e \ 0. 


Finally, iterating over all arms 5 1, we have 


1 

epW-jEE Ep[A’6j-] 

b=l j¥=b 


n n 

2EEp[«sI>2E®o[Ws1> 


n — 2 
n — 1 


2 



(si - SbY 


log 


1 


□ 


B Proof of Upper Bound 

To prove the theorem we first need a technical lemma. 

Lemma 3. For all s G N, let A be drawn independently and uniformly at 
random from [n] and let z'fj be a Bernoulli random variable with mean pij. If 
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Pi,3,t = j eLi ^ = \l '^^°^^t/T'^^ - 

then P (U(ij)GN2:i#j UEi {iKj.t -Kjl > C't}) < S. 


2 log(4n^i^/(5) 
3t/n 


taking values in [0,n] with E 


.( 


Proof. Note that = Es=i ^ sum of i.i.d. random variables 


< n^E [1/,=^] < n. A direct ap¬ 


plication of Bernstein’s inequality (Boucheron et al. 20131 and union bounding 
over all pairs {i,j) G [n]^ and time t gives the result. □ 

A consequence of the lemma is that by repeated application of the triangle 
inequality, 


V,,,-i(ll)-V,,,(ll) 









— ^ ^ \Pi^uj.t Pi,a; I P |Pj,c^ Pj^uj.t\ 

< 2\n\Ct 


and similarly 


< 2(1 -I- inDCt for all i, j € [n] with i ^ j, 


all t S N and all 12 C [n]. We are now ready to prove Theorem 


Proof. We begin the proof by defining Ct{^) = 2(1 -|- [HDCt and considering the 
events 


n n {iw t(r!)-A,j(i2)| <Ct(r2)} , 

t=l nc[ra] 

OO 

n n {|V,.,-t(r!)-V,j(!f!)| <Ct(r2)} , 

t=l nc[n] 

n /log(4nt2/i5)) 

n - 1 V 2t J 


nn 


|Si,i Sil < 


that each hold with probability at least 1 — 5. The first set of events are a conse¬ 
quence of Lemma [3] and the last set of events are proved using a straightforward 
Hoeffding bound (^Boucheron et al. 20131 and a union bound similar to that in 


Lemma m In what follows assume these events hold. 

Step 1: If t ^ Xq and Sx — Sj ^ then j ^ 

We begin by considering all those j G [n] \ 1 such that si — Sj > i? and show 
that with the prescribed value of Tq, these arms are thrown out before t > Tg. 
By the events defined above, for arbitrary i G [n] \ 1 we have 


^ ^ ^ ^ 2n 

Si,t ~ Si t = Si t — Si + Si — Si t + Si — Si < Si — Si H-- 

n — 1 


log{Ant‘^/6) 


2t 


< 


2n /log(4nt^/(5) 
n- 1V' 2t ■ 
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since by definition si > Si. This proves that the best arm will never be thrown 
out using the Borda reduction which implies that 1 G At for all t <Tq. On the 
other hand, for any j € [n] \ 1 such that si — Sj > R and t < Tq we have 

max Sj f ^ 

ieAt 

2n I log(Ant‘^ / 6) 

^ - 2 ^- 

Aij([n]) 2n I log{4nt‘^ / S) 

n—1 n—iV 2t 

If Tj is the first time t that the right hand side of the above is greater than or 
equal to 7^ then 


32n2 ^ / 32nyS \ 

-AfTH) 

since for all positive a,b,t with a/5 > e we have t > ^ ^ > iog(at) ^ 

Thus, any j with = si — Sj > R has Tj < Tq which implies that any 

i G At for t > Tq has si — Si < R. 


Step 2: For all t, 1 G At . 

We showed above that the Borda reduction will never remove the best arm 

from At- We now show that the sparse-structured discard condition will not 

remove the best arm. At any time t > Tq, let i G [u] \ 1 be arbitrary and let 

rti = arg max Vi i *(0) and Oi = arg max Vi Note that for 

nc[n]:|n|=fe ’ ’ QC[n]:|n|=A; 

any O C [n] we have Vip(O) = Viy(O) but Aip(O) = —Ai_i(0) and 


Aip,((Oi) < Ai^i(Oi) -|- C't(Oi) 

= Ai^i(Oi) — Ai^i(r2i) -|- Ai^i(f2i) -|- Ct{f^i) 


— ~i ~ ~ X Ai i(Oi) -I-C't(Oi) 


\u}eo,i\o.i 


since (Ea;en.\f2. - Pi,S) < Vy* < iAi_i(0,) by the conditions 
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of the theorem. Continuing, 


^ ~ {pi,bj — Pi,uj) ~ o ”1” ^*(^0 


\u)^Q,i\Q.i 


— ( \pi,^^,t ~ pi,io,t\ ~ + Ct(Oi) + Ct(Oi \ Oi) 


\u}&Q,i\Q,i 


< \Pi,u},t — Pl,uj,t\ ~ + C't(Oi) + Ct(Oi \ Oi) 


V^eOiVni 


— I bi,w~pi,wl) ~ + c't(fii) + \ iii) + \ r^i) 


< ——Ai j(r2j) + Ct{^i) + Ct{^i \ + Ct{^i \ ^i) 

<3 max Ct{^) = 6(1 + k)Ct 

QC[n]:\n\<k 

where the third inequality follows from the fact that Vjq^t \ < 

by definition, and the second-to-last line follows again by the same theorem con¬ 
dition used above. Thus, combining both steps one and two, we have that 1 & At 
for all t. 


Step 3 : Sample Complexity 

At any time t > Tq, let j G [n]\l be arbitrary and let fli = arg ^ 
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and = are max Vi We begin with 

ncH:|n|=fc 


max ) 


> Aij-(Oi) — Ct{^i) 

> Aij-(Oi) — Aij(f2i) + Aij(f2i) — Ct(fii) 

= I '^iPi,u,-Pj,u,)\ - ( iPi,u - Pj,u,)] + 
\wen / / 

X! {Pi,0J - Pl,uj^ - Cti^i) 


\ui&Ui\Q 


I \pi,uj,t — Pi,iAj,t\ + xAij(f2i) — C't(r2i) — Ct(f2i \ rii) 




I |pi,w,t ~ + XAij(f2i) — Ct(Oi) — Ct(Oi \ Oi) 


WsHitni 

A~| ~Pi,i^l I + 2Aij(rJi) — c't(Oi) — \r2i) — \rii) 

>iAij(r2i)-3 max C't(ri) = ^Aij(rij) - 6(1 + 
o nc[n]:|n|<fc O 

by a series of steps as analogous to those in Step 2. If tj is the first time t > Tq 
such that the right hand side is greater than or equal to 6(1 + k)Ct, the point 
at which j would be removed, we have that 




20736n(fc + l)2 / 20736n2(fc + 1)^ 

A2 /b!,) Al^{n,)6 


using the same inequality as above in Step 2. Combining steps one and three 
we have that the total number of samples taken is bounded by 


min 

l>i 


max 



20736n(fc + l)^ 


log 


/ 20736n2(fc + 1)2\ 

A2_^.(fl,)^ ) 


32n2 / 32n^/6 \ | 

AfTR) ) J 


with probability at least 1 — 35. The result follows from recalling that = 

Si — Sj and noticing that < 2 for n > 2. □ 


21 



